5   DMA COPROCESSOR

 

 

 

5.1 DESIGN OF THE DMA COPROCESSOR

 

 

 

COPROCESSOR PHILOSOPHY

 

 

A coprocessor is a slave CPU which computes in parallel with the master CPU.  Floating point operations are often handled by a coprocessor to free up the master CPU for other work.  Many applications require high speed I/O with minimum CPU overhead.  Special DMA, video display, and network control chips are usually used.  These chips add expense and complexity to the application.  ShBOOM provides a coprocessor to dispatch input/output with the off-chip world with minimum interaction from the master CPU.  This unit is called the DMA Coprocessor.

 

On the same die with the ShBOOM CPU, the DMA Coprocessor acts as a loosely coupled slave processor.  The coprocessor shares memory space with the CPU but has its own program counter and instruction set.  Contrary to the conventional coprocessor design, the DMA Coprocessor always has bus priority so it can dispatch input/output in a time predictable manner with minimum master CPU overhead.     

 

Coprocessor I/O devises are memory mapped in the ShBOOM memory space.  Simple decoding of high order address bits output by the Transfer Address Register (See below) selects a specific device.

 

 

THE DMA COPROCESSOR

 

 

The DMA Coprocessor is a self contained CPU which fetches and executes a special instructions set from memory shared with the ShBOOM CPU.  The instructions move data between the specified I/O devices and ShBOOM dynamic RAM.  The coprocessor continuously executes instructions, and may run without interaction from the ShBOOM CPU.

 

The coprocessor has a 20-bit program counter and the coprocessor memory space overlaps the ShBOOM CPU space starting at location zero.  An important feature of the coprocessor is that its input/output transfers are predictably in time.  The coprocessor takes precedence over any other activity (interrupt, master CPU data fetch, or instruction prefetch) which might request the system bus.

 

The coprocessor is therefore very useful for generating and interpreting video timing signals, disk drive interfacing, and network signals which must happen at precise intervals.  In a ShBOOM system, the coprocessor can perform tasks including DRAM refresh, high quality sound generation, MODEM signal generation and reception, or multiprocessor communications.

 


5.2 DMA COPROCESSOR REGISTERS

 

 

Figure 5.1 shows the coprocessor block diagram and registers.  Requests for the bus from the ShBOOM CPU must pass through the coprocessor.  This allows the coprocessor to maintain the highest priority for bus access.

 

 

CPC (COPROCESSOR PROGRAM COUNTER)

 

The 20-bit Coprocessor Program Counter points to instructions which reside in the first 1 megabytes of the SHBOOM memory.  (Since the CPC counts in increments of 32-bit instructions, the least significant bits of the CPC are always zero.)  A coprocessor program may be as short as a single 32-bit instruction (JUMP to itself) or as long as the entire 1 megabyte coprocessor instruction space.

 

The CPC is incremented by four after execution of each instruction and can be loaded by a JUMP instruction or by a write operation from the ShBOOM CPU.

 

 

FLAG

 

FLAG is a single bit register which can be polled by the SHBOOM CPU to determine when a specific coprocessor has executed.  FLAG is set or reset by the JUMP instruction.

 

 

TIR (TRANSFER INTERVAL REGISTER)

 

The TIR is a 12-bit register which counts the number of system clock pulses between timed input/output transfers.  For example, in a video display, while displaying a scan line the TIR would correspond to the time delay representing the pixel rate divided by 32-bits.

 

The TIR is a write only register and can be loaded by a LOAD or MOVE or instruction.  It is decremented by each system clock.  The TIR cannot be read or written by the ShBOOM CPU.

 

 

TSC (TRANSFER SIZE COUNTER)

 

The TSC is a 12-bit register which counts the number of input/output transfers scheduled by a LOAD or MOVE instruction.  It is decremented each time an input/output transfer is performed.  The TSC is write only. The TSC cannot be read or written by the ShBOOM CPU.

 

 

TAR (TRANSFER ADDRESS REGISTER)

 

The TAR is a 32-bit register which points to the I/O memory source or destination scheduled by a LOAD or MOVE instruction.  It is incremented each time an input/output transfer is performed.  The TAR is write only.     The TAR cannot be read or written by the ShBOOM CPU.

 

 

 

 


5.3 DMA COPROCESSOR INSTRUCTIONS

 

 

 

The coprocessor fetches and executes the following 32-bit instructions:

 

            LOAD

            MOVE-RAM-I/O

    MOVE-I/O-RAM

            REFRESH

            READ&PACK

            TEST-WAIT-WRITE

            TEST-WAIT-READ

            TEST-WRITE

            TEST-READ

            JUMP

 

Bits of the Binary Op Codes are interpreted as follows:

 

        R   - Refers to a bit destined for the Transfer Address Register

        S   - Refers to a bit destined for the Transfer Size Counter

        X   - A bit which sets or resets the FLAG register

        Y   - A bit which is ORed with its corresponding I/O Register bit and

                   sent to a corresponding output latch

        N   - Cause an interrupt is set to "1" in a JUMP.

        P   - Refers to a bit destined for the Coprocessor Program Counter

        I   - Refers to a bit destined for the Transfer Interval Register

 

 

 

 

 

LOAD

 

 

 

 

Binary Opcode            RRRR RRRR RRRR RRRR RRRR RRRR RRRR RR10

 

 

 

The LOAD instruction loads the value specified by (RRRR...) into the Transfer Address Register.  


MOVE-RAM-I/O

 

 

 

 

Binary Opcode           SSSS SSSS SSSS IIII IIII IIII 0110 0000

 

 

 

This instruction performs the following functions:

 

            Reads the number of 32-bit words specified by (SSSS...) from the memory location     pointed to by the Transfer Address Register,

            Waits the interval of system cycles specified by (IIII...) between transfers.

            Writes the number of 32-bit words specified by (SSSS...) to the selected I/O device.

 

 

 

 

 

MOVE-I/O-RAM

 

 

 

 

Binary Opcode          SSSS SSSS SSSS IIII IIII IIII 0100 0000

 

 

 

This instruction perform the following functions:

 

            Reads the number of 32-bit words specified by (SSSS...) from the selected I/O device,

            Waits the interval of system cycles specified by (IIII...)  between transfers.

            Writes the number of 32-bit words specified by (SSSS...) to the memory location           pointed to by the TAR.

 


REFRESH

 

                 

 

 

Binary Opcode            SSSS SSSS SSSS IIII IIII IIII 0110 1100

 

 

 

 

This instruction does the following:

 

            Performs the number of CAS before RAS refresh cycles specified by (SSSS...),

            Waiting the interval of system cycles specified by (IIII...) between refresh cycles.

 

 

 

 

 

READ&PACK

 

 

 

 

Binary Opcode           SSSS SSSS SSSS IIII IIII IIII 0000 0000

 

 

 

 

This instruction does the following:

 

            Reads an 8-bit byte applied to the low order data lines of the selected I/O device,

            Waits the interval of system cycles specified by (IIII...),

            Repeats the process three more times, shifting each byte eight places to build a 32-bit         value,

            Writes the 32-bit value to the location specified by the Transfer Address Register,

            Repeats the process the number of times specified by (SSSS...).

 

READ&PACK is performed automatically upon power-on reset to load the bootstrap program from a byte-wide PROM.  The instruction is also useful for interfacing with 8-bit peripherals such as HDTV A/D converters.


TEST-WAIT-WRITE

 

 

 

 

Binary Opcode   RRRR RRRR RRRR RRRR RRRR RRRR RRRR RR10

 

 

 

TEST-WAIT-WRITE tests the specified input line for a logic "0" level.  If the line is a logic "1", coprocessor execution halts until the line becomes a "0". 

 

When the line is pulled low to "0", the specified input port is read and the data is written to address specified by (RRRR .....) in the instruction.  The address specifier (RRRR ......) bits of the instruction are incremented in memory.  If the least significant 12 bits of the address are zero, the instruction will act as a NOP when next executed.

 

 

 

 

 

TEST-WAIT-READ

 

 

 

 

Binary Opcode   RRRR RRRR RRRR RRRR RRRR RRRR RRRR RR10

 

 

 

TEST-WAIT-READ tests the specified input line for a logic "0" level.  If the line is a logic "1", coprocessor execution halts until the line becomes a "0".

 

When the line is pulled low to "0", the data at the address specified by (RRRR ......) is read and written to the specified output port.  The address specifier (RRRR .....) bits of the instruction are incremented in memory.  If the least significant 12 bits of the address are zero, the instruction will act as a NOP when next executed.


TEST-WRITE

 

 

 

 

Binary Opcode   RRRR RRRR RRRR RRRR RRRR RRRR RRRR RR10

 

 

 

TEST-WRITE test the specified input line for a logic "0" level.  If the line is a logic "1", the instruction acts as a NOP.

 

If the line is pulled low to "0", the specified input port is read and the data is written to the address specified by (RRRR .....) in the instruction.  The address specifier (RRRR .....) bits of the instruction are incremented in memory.  If the least significant 12 bits of the address are zero, the instruction will act as a NOP when next executed.

 

 

 

 

 

TEST-READ

 

 

 

 

Binary Opcode   RRRR RRRR RRRR RRRR RRRR RRRR RRRR RR10

 

 

 

TEST-READ tests the specified input line for a logic "0" level.  If the line is a logic "1". the instruction acts as a NOP.

 

If the line is pulled low to "0", the data at the address specified by (RRRR .....) is read and written to the specified output port.  The address specifier (RRRR .....) bits of the instruction are incremented in memory.  If the least significant 12 bits of the address are zero, the instruction will act as a NOP when next executed.


JUMP

 

 

 

 

Binary Opcode           0000 0000 NYYX PPPP PPPP PPPP PPPP PP11

 

 

 

JUMP loads the Coprocessor Program Counter with the specified 18-bit value (PPPP...).  The next coprocessor instruction will be fetched from the specified word address.  The Coprocessor Flag Register is cleared to "0" if X=0 and set to "1" otherwise.  FLAG provides communications to the SHBOOM CPU that a JUMP has been executed.

 

For example, at the end of a video scan line, setting FLAG can inform the CPU to set up the next line.

 

If N = "1", the ShBOOM CPU will be interrupted when JUMP executes.  The CPU will push the current Program Counter onto the Return Stack and set the Program Counter to "0" and continue executing.

 

YY is ORed with bits 0 and 1 from the I/O register and then output to the I/O latch.  In this way, execution of the JUMP instruction can make one or both external lines go high.


5.4 PROGRAMMING CONSIDERATIONS

 

 

 

 

DRAM REFRESH

 

DMA Coprocessor instructions can take thousands or millions of cycles to execute.  A dynamic RAM must be entirely refreshed every 8 msec.  The system designer must make certain that provisions for distributed or lumped refresh have been made consistent with the specifications of the DRAM in use.

 

In most I/O configurations, there are natural opportunities for refresh that may be easily exploited by the coprocessor.  In a video display, the time during horizontal retrace is suitable.  In a disk controller, the end of a sector read or write might be convenient for a burst refresh.

 

 

BUS BANDWIDTH CONSIDERATIONS

 

The most speed limiting resource of a computer system is memory bus bandwidth.  ShBOOM is very efficient with bus bandwidth by fetching four instructions in each memory cycle.  However, the DMA Coprocessor fetches instructions and data using the same bus. 

 

The coprocessor always maintains priority on bus availability so its I/O transfers can be precise.  The more frequently the coprocessor makes I/O requests, the less bandwidth is available for the ShBOOM CPU.  It is possible in an extreme case for the coprocessor to completely block all bus access.

 

In most applications, the coprocessor will spend very little bus time fetching and executing its instructions, since an instruction may take up to 16 million internal machine cycles to execute before the next bus cycle is required.

 

For example, in a simple video application driving a 50 MHZ display with a shift register, input/output transfers will consume about 9 per cent of the bus bandwidth.  After each I/O transfer, the ShBOOM CPU will probably have to initiate a memory RAS cycle to reset the fast page mode to point to the current program.  Systems such as hard disc or network controllers will use about 1 per cent of the bus bandwidth.

 

 

INITIALIZATION

 

ShBOOM fetches and executes instructions from DRAM.  On power up, the first operation is to transfer the program to be executed from PROM to the DRAM.

 

When power is first applied or when the RESET line is pulled low, the following RESET sequence is performed to initialize the internal registers, to load the program from PROM into DRAM, and to begin execution:

 

            1. Block execution of the ShBOOM CPU,

            2. Clear the Coprocessor Program Counter to "0",

            3. Clear the Transfer Address Register to "0",

            4. Set the Transfer Size Counter to hex "FFF",

            5. Set the Transfer Interval Register to hex "7",

            6. Enable execution of the I/O Coprocessor, (The coprocessor will load 12     bytes from the external PROM into the lowest ShBOOM memory          locations.)

            7. Set the ShBOOM CPU Program Counter to HEX byte "....0008",

            8. Enable execution of the ShBOOM CPU.

 

The second and third 32-bit instructions loaded from the PROM will place the coprocessor in a refresh loop.  The ShBOOM CPU will begin executing the code loaded by the coprocessor.  The CPU will usually direct the coprocessor to read in the rest of the PROM contents using READ&PACK.

 

 

COMMUNICATIONS WITH THE ShBOOM CPU

 

The coprocessor can be initialized by the ShBOOM CPU and then left to run independently.  The CPU directs the coprocessor by writing instructions into the shared memory.  The CPU can direct the coprocessor to begin executing a different program by writing a JUMP instruction to the new coprocessor program.

 

The coprocessor communicates to the ShBOOM CPU with the FLAG bit and by a dedicated interrupt.  The coprocessor JUMP instruction can set or reset the FLAG bit when the JUMP is taken.  This bit may then be polled by the CPU.

 

If the interrupt bit is set in the JUMP op-code, at the completion of the next 4-byte instruction group, the CPU will push the Program Counter on the Return Stack and JUMP to location "0" to execute the I/O coprocessor interrupt routine.     

 

 

TIMED DMA TRANSFERS

 

Transfers timed by the internal system clock allow the creation of streams of data at precise intervals or the sampling of streams of input at precise intervals.

 

A video signal is one of the most demanding timed DMA transfers.  The scan line pixels must be output precisely to avoid display distortion.

 

In another example, a sin wave can be generated by outputing a consecutive group of numbers to a Digital to Analog converter at precise intervals.  In fact any waveform can be reproduced in this fashion.

 

In a MODEM receiver application, timed DMA input transfers can sample an Analog to Digital converter at precise intervals and store the signal values for digital signal processing into binary data by the ShBOOM CPU.

 

The I/O Coprocessor takes precedence over all other bus activities, so that timed transfers will always be precise.

 

 

CLOCKED DMA TRANSFERS

 

Transfers clocked by one of eight external clock lines allow the transfer of streams of data upon demand of an external device.

 

A typical application of a clocked transfer is found in a disc drive.   When reading a sector, the disc would switch a clock line to indicate that data was ready to be read.  The coprocessor would then transfer the data directly into memory.

 

During a write operation, the drive would switch another line indicating it was ready to receive the next data.  The coprocessor would read the next data from memory and transfer it directly to the disc.

 

Local Area Network interfaces are other excellent applications for clocked DMA transfers.

 

 

CLOCKED DMA I/O VERSUS INTERRUPT DRIVEN I/O

 

Interrupts are often used in minicomputer and microprocessor systems to control I/O transfers.  For example, a UART interrupts a microprocessor to indicate a byte of information is ready.  The microprocessor saves its current state and jumps to the interrupt service routine.

 

The routine usually reads the byte, stores it into the receive buffer, and then restores the microprocessor's preinterrupt state.  The overhead of state saving, jumping to routine, and then returning reduces the efficiency of interrupt initiated transfers by a factor of from ten to thirty compared to clocked DMA.

 

Interrupt response time is a greatest I/O limitation in many microprocessors.  The response time of ShBOOM clocked DMA is one memory cycle for the highest priority device (120 nsec.), and a worst case of eight cycles for the lowest priority.

 

The ShBOOM chip offers eight high speed clocked DMA channels as the most efficient method of transferring data to and from peripheral devices.

 

A ShBOOM interrupt waits for the current four byte instruction group to finish executing before responding.  The worse case response time would be encountered if four multiply instructions occurred in a row and the I/O coprocessor made heavy demands on the bus.  It is thereby theoretically possible to delay an internal interrupt by as much as 4 microseconds.

 

 

 


5.5 EXAMPLES OF THE DMA COPROCESSOR USAGE

 

 

 

 

Video Display

 

Embedded controllers are always controlling I/O, often at high speed.  The I/O coprocessor can be programmed to automatically retrieve pixels from a screen buffer, output the pixels to a video shift register, generate sync signals, generate retrace, and continuously repeat until instructed differently by the ShBOOM CPU.

 

Because the coprocessor has bus priority, the display is flicker free regardless of CPU operations including interrupts.

 

 

Hard Disk Controller

 

The coprocessor can read data directly from a hard disk and place it in a temporary buffer.  While the next sector is being read, the CPU can perform error checking or decompress the data and place it in the sector buffer.

 

The same CPU can compress outgoing data, place it in a temporary buffer, and notify the coprocessor.  While the CPU works on compressing the next sector, the coprocessor can position the head, wait for the correct sector, and write out the data.

 

 

Lan Network or PBX Controller

 

As a high-speed data concentrator, the coprocessor can receive input from eight 10 Mbit/sec lines and store the received data in eight different temporary buffers.  The CPU can be notified when any buffer is filled.  Writing out to the lines can be similarly performed.

 

 

Scalable Font Generator

 

In a graphics printer application, the CPU can generate fonts on the fly, place the pixels in a temporary buffer, and notify the coprocessor which outputs them to the print head or laser.  As a result on computing the fonts on the fly, a scalable font printer can be produced with one tenth the memory required by other systems.

 

 

Universal MODEM

 

The coprocessor reads the output from the A/D into a circular temporary buffer.  The CPU can perform the DSP algorithms while A/D input is continuously read.  The CPU can also perform error correcting and decompression.  Output to the D/A can be performed concurrently with input by the coprocessor.